perf: optimize LTX2 inference latency and implement granular TPU profiling by mbohlool · Pull Request #389 · AI-Hypercomputer/maxdiffusion

mbohlool · 2026-04-23T00:20:42Z

Optimize LTX2 inference latency and implement granular TPU profiling

Description

This PR introduces critical performance optimizations and comprehensive profiling infrastructure for the LTX2 video generation pipeline on TPU hardware.

Key Changes

1. Inference Parallelism Optimization (`ltx2_video.yml`)

Switched from ICI Context Parallelism (ici_context_parallelism: 1) to ICI Data Parallelism (ici_data_parallelism: -1).

Impact: Since Classifier-Free Guidance (CFG) generates independent batch items, DP acts "embarrassingly parallel" for inference. This requires zero cross-core communication, completely bypassing the massive All-Gather ICI bottlenecks caused by sequence-sharding.

2. Granular XLA Profiling Annotations (`ltx2_pipeline.py`)

Injected jax.named_scope wrappers around all major TPU-bound compute blocks (Connectors, Video VAE, Audio VAE, Vocoder).

Impact: This prevents massive operations from appearing as unlabeled blobs in the Cloud TPU Profiler (xprof), enabling accurate FLOPs tracking and roofline analysis for individual components outside of the main denoising loop.

3. Execution Timing & Benchmarking (`generate_ltx2.py` & `ltx2_pipeline.py`)

Added synchronous jax.block_until_ready() wrappers at the boundaries of major pipeline stages to accurately measure execution time without asynchronous JAX dispatch artifacts.

Impact: Restructured the generation script into a 3-pass strategy (Warmup, Generation, Profiling) using skip_first_n_steps_for_profiler: 0 to completely isolate true runtime execution latency from initial JIT compilation overhead.

github-actions · 2026-04-23T00:20:53Z

e2e testgrid: https://8bcf50593faf4ea38060e236169827e5-dot-us-central1.composer.googleusercontent.com/dags/maxdiffusion_tpu_e2e/grid

- Switched to DP (ici_data_parallelism: -1) in ltx2 config to bypass ICI communication overhead during inference. - Added `jax.named_scope` around connectors and VAE blocks for accurate xprof trace attribution. - Added synchronous `perf_counter` wrappers in the pipeline to measure true stage latencies. - Implemented a 3-pass (warmup, run, profile) generation loop in `generate_ltx2.py` to isolate JIT compilation time and better profiling.

Perseus14 · 2026-04-23T04:35:53Z

@mbohlool Could you add a table with the latency gain (single video and amortized throughput) of this change with the baseline (main)?

Thanks!

mbohlool requested a review from entrpn as a code owner April 23, 2026 00:20

mbohlool force-pushed the mehdy_perf branch from a1978c0 to c2eae2f Compare April 23, 2026 00:22

mbohlool force-pushed the mehdy_perf branch from c2eae2f to 6bd35bf Compare April 23, 2026 00:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize LTX2 inference latency and implement granular TPU profiling#389

perf: optimize LTX2 inference latency and implement granular TPU profiling#389
mbohlool wants to merge 1 commit intomainfrom
mehdy_perf

mbohlool commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Perseus14 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mbohlool commented Apr 23, 2026

Optimize LTX2 inference latency and implement granular TPU profiling

Description

Key Changes

1. Inference Parallelism Optimization (ltx2_video.yml)

2. Granular XLA Profiling Annotations (ltx2_pipeline.py)

3. Execution Timing & Benchmarking (generate_ltx2.py & ltx2_pipeline.py)

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Perseus14 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Inference Parallelism Optimization (`ltx2_video.yml`)

2. Granular XLA Profiling Annotations (`ltx2_pipeline.py`)

3. Execution Timing & Benchmarking (`generate_ltx2.py` & `ltx2_pipeline.py`)